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NEURAL NETWORK POST -PROCESSOR 

The present invention relates to a neural network 
post-processor and more particularly to such a processor 
incorporated within a recogniser portion of an automated 
speech recognition system. 

BACKGROUND 

Automated speech recognition is a complex task in 
itself. Automated speech understanding sufficient to 
provide automated dialogue with a user adds a further layer 
of complexity. 

In this specification the term ''automated speech 
recognition system" will refer to automated or 
substantially automated systems which perform automated 
speech recognition and also attempt automated speech 
understanding, at least to predetermined levels sufficient 
to provide a capability for at least limited automated 
dialogue with a user. 

A generalized diagram of a commercial grade automated 
speech recognition system as can be used for example in 
call centres and the like is illustrated in Fig. 1. 

With advances in digital computers and a significant 
lowering in cost per unit of computing capacity there have 
been a number of attempts in the commercial marketplace to 
install such automated speech recognition systems 
implemented substantially by means of digital computers. 
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However, to date, there remain problems in achieving 
100% recognition and/or 100% understanding in real time. 

In one particular form critical to the success or 
otherwise of any given recognition schema there are 
5 difficulties in classifying patterns as correctly 
recognized or incorrectly recognized/not modeled. 

It is an object of the present invention to address or 
ameliorate one or more of the abovementioned disadvantages. 

10 BRIEF DESCRIPTION OF INVENTION 

Accordingly, in one broad form of the invention there 
is provided in a speech recognition system of the type 
adapted to process utterances from a caller or user by way 
15 of a recogniser, an utterance processing system and a 
dialogue processing system so as to produce responses to 
said utterances, a method of gauging correctness of a 
pattern recognition task by mapping an input feature space 
onto a probability space. 
20 Preferably said mapping is a non- linear mapping. 

Preferably said method is applied to a confidence 
scoring task. 

Preferably said method utilizes a multi-layer 
perceptron to apply multiple knowledge sources to said 
25 confidence scoring task. 

In a further broad form of the invention there is 
provided in a speech recognition system of the type adapted 
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to process utterances from a caller or user by way of a 
recogniser, an utterance processing system and a dialogue 
processing system so as to produce responses to said 
utterances, a method of obtaining a confidence score by a 
5 non-linear mapping onto the real number line. 

Preferably said method utilizes an MLP to generate an 
aposteriori probability for confidence. 

Preferably said MLP is trained with a mean squared 
error. 

10 Preferably said MLP is trained utilizing a cross- 

entropy cost function. 

Preferably said MLP is additionally trained with some 
sigmoidal non-linearity. 

In yet a further broad form of the invention there is 

15 provided in a speech recognition system of the type adapted 
to process utterances from a caller or user by way of a 
recogniser, an utterance processing system and a dialogue 
processing system so as to produce responses to said 
utterances, a method of confidence scoring utilizing a data 

20 driven system. 

BRIEF DESCRIPTION OF DRAWINGS 



Embodiments of the present invention will now be 
25 described with reference to the accompanying drawings 
wherein: 
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Fig. 1 is a generalized block diagram of a prior art 

automated speech recognition system; 

Fig. 2 is a generalized block diagram of an automated 

speech recognition system suited for use in conjunction 

with an embodiment of the present inventions- 
Fig. 3 is a more detailed block diagram of the 

utterance, processing and dialogue processing portions of 

the system of Fig. 2; 

Fig. 4 is a block diagram of the system of Fig. 2 

incorporating a neural network post -processor in accordance 

with a first embodiment of the present invention. 

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS 

With reference to Fig. 2 there is illustrated a 
generalized block diagram of an automated speech 
recognition system 10 adapted to receive human speech 
derived from user 11, and to process that speech with a 
view to recognizing and understanding the. speech to a 
sufficient level of accuracy that a response 12 can be 
returned to user 11 by system 10. In the context of 
systems to which embodiments of the present invention are 
applicable the response 12 can take the form of an auditory 
communication, a written or visual communication or any 
other form, of communication intelligible to user 11 or a 
combination thereof. 
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In all cases input from user 11 is in the form of a 
plurality of utterances 13 which are received by transducer 
14 {for example a microphone) and converted into an 
electronic representation 15 of the utterances 13. In one 
5 exemplary form the electronic representation 15 comprises a 
digital representation of the utterances 13 in .WAV format. 
Each electronic representation 15 represents an entire 
utterance 13. The electronic representations 15 are 
processed through front end processor 16 to produce a 

10 stream of vectors 11, one vector for example for each 10ms 
segment of utterance 13. The vectors 17 are matched 
against knowledge base vectors 18 derived from knowledge 
base 19 by back end processor 20 so as to produce ranked 
results 1-N in the form of N best results 21. The results 

15 can comprise for example subwords, words or phrases but 
will depend on the application. N can vary from 1 to very 
high values, again depending on the application. 

An utterance processing system 26 receives the N best 
results 21 and begins the task of assembling the results 

20 into a meaning representation 25 for example based on the 
data contained in language knowledge database 31. 

The utterance processing system 26 orders the 
resulting tokens or words 23 contained in N-best results 21 
into a meaning representation 25 of token or word 

25 candidates which are passed to the dialogue processing 
system 2 7 where sufficient understanding is attained so as 



1 



wo 02/103673 



PCT/AU02/00803 



to permit functional utilization of speech input 15 from 
user 11 for the task to be performed by the automated 
speech recognition system 10. In this case the 

functionality includes attaining of sufficient 
5 understanding to permit at least a limited dialogue to be 
entered into with user/caller 11 by means of response 12 in 
the form of prompts so as to elicit further speech input 
from the user 11. In the alternative or in addition, the 
functionality for example can include a sufficient 

10 understanding to permit interaction with extended databases 
for data identification. 

Fig. 3 illustrates further detail of the system of 
Fig. 2 including listing of further functional components 
which make up the utterance processing system 26 and the 

15 dialogue processing system 27 and their interaction. Like 
components are numbered as for the arrangement of Fig. 2. 

The utterance processing system 26 and the dialogue 
processing system 27 together form a natural language 
processing system. The utterance processing system 26 is 

20 event -driven and processes each of the utterances 13 of 
caller/user 11 individually. The dialogue processing 
system 27 puts any given utterance 13 of caller/user 11 
into the context of the current conversation (usually in 
the context of a telephone conversation) . Broadly, in a 

25 telephone answering context, it will try to resolve the 
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query from the caller and decide on an appropriate answer 
to be provided by way of response 12 . 

The utterance processing system 26 takes as input the 
output of the acoustic or speech recogniser 30 and produces 
5 a meaning representation 25 for passing to dialogue 
processing system 27. 

In a typical, but not limiting form, the meaning 
representation 25 can take the form of value pairs. For 
example, the utterance want to go from Melbourne to 

10 Sydney on Wednesday" may be presented to the dialogue 
processing system 27 in the form of three value pairs, 
comprising: 

1 . St art ; Me Ibourne 

2. Destination; Sydney 
15 3. Date; Wednesday 

where, in this instance, the components Melbourne, Sydney, 
Wednesday of the value pairs 24 comprise tokens or words 
23, 

With particular reference to Fig. 3 the recogniser 30 
20. provides as output N best results 21 usually in the form of 
tokens or words 23 to the utterance processing system 26 
where it is first disambiguated by language model 32. In 
one form the language model 32 is based on trigrams with 
cut off. 

25 Analyser 33 specifies how words derived from language 

model 32 can be grouped together to form meaningful phrases 
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which are used to interpret utterance 13 . In one form the 
analyzer is based on a series of simple finite state 
automata which produce robust parses of phrasal chunks - 
for example noun phrases for entities and concepts and WH- 
5 phrases for questions, dates. . Analyser '33 is driven by 
grammars such as meta-grammar 34. The grammars themselves 
must be tailored for each application and can be thought of 
as data created for a specific customer. 

The resolver 35 then uses semantic information 
10 associated with the words of the phrases recognized as 
relevant by the analyzer 33 to refine the meaning 
representation 25 into its final form for passing through 
the dialogue flow controller 36 within dialogue processing 
system 27. 

15 The dialogue processing system 27, in this instance 

with reference to Fig. 3, receives meaning representation 
25 from resolver 35 and processes the dialogue according to 
the appropriate dialogue models. Again, dialogue models 
will be specific to different applications but some may be 

20 reusable. For example a protocol model may handle 
greetings, closures, interruptions, errors and the like 
across a number of different applications. 

The dialogue flow controller 36 uses the dialogue 
history to keep track of the interactions. 

25 The logic engine 37, in this instance, creates SQL 

queries based on the meaning representation 25. Again it 
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will be dependent on the specific application and its 
domain knowledge base, 

The generator 38 produces responses 12 (for example 
speech out) . In the simplest form the generator 38 can 
5 utilize generic text to speech (TTS) systems to produce a 
voiced response. 

Language knowledge database 31 comprises, in the 
instance of Fig. 3, a lexicon 3 9 operating in conjunction 
with database 40. The lexicon 39 and database 40 operating 

10 in conjunction with knowledge base mapping tools 41 and, as 
appropriate, language model 32 and grammars 34 constitutes 
a language .knowledge database 31 or knowledge base which 
deals with domain specific data. The structure and 
grouping of data is modeled in the knowledge base 31. 

15 Database 40 conprises raw data provided by a customer. 

In one instance this data may comprise names, addresses, 
places, dates and is usually organised in a way that 
logically relates to the way it will be used. The database 
40 may remain unchanged or it may be updated throughout the 

20 lifetime of an application. Functional implementation can 
be by way of database servers such as MySQL, Oracle, 
Postgres . 

As will be observed particularly with reference to 
Fig. 3, interaction between a number of components in the 
25 system can be quite complex with lexicon 39, in particular. 



wo 02/103673 



PCT/AU02/00803 



being used by and interacting with multiple components of 
System 10, 

With reference to Fig. 4 there is shown in block 
diagram form a neural network post -processor 410 in 
5 accordance with a first preferred- embodiment of the present 
invention. 

In this instance the processor 410 is applied to the 
output of recogniser 30. 

Broadly neural network post -processor 410 utilises a 
10 multi- layer perceptron to apply multiple knowledge sources 
to the problem. This performs a non-linear mapping onto 
the real number line to give us a confidence score. 

This differs from previous solutions in several ways: 
1. The application of multiple knowledge sources 
15 2, The use of an MLP to generate an aposteiriori 

probability for confidence. 

This solution can be implemented simply using an MLP 
trained with either a mean squared error or a cross -entropy 
cost function, and some sigmoidal non-linearity- Our 
20 experimental system was trained using conjugate gradient 
descent (back-propagation) . 

The system 10 incorporating processor 410 is data 
driven, rather than based on some heuristic technicjue/ as 
such (for a representative corpus) . It 'learns' an optimal 
25 mapping from input data to a correct/incorrect mapping. 
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In effect, in order to gauge end best results 21 
derived from recogniser 30 the neural network post- 
processor 410 non- linearly maps an input feature space of a 
pattern recognition task onto the probability space for 
gauging correctness as applied to confidence scoring. 

The above describes only some embodiments of the 
present invention and modifications, obvious to those 
skilled in the art, can be made thereto without departing 
from the scope and spirit of the present invention. 
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CLAIMS 

1. In a speech recognition system of the type adapted to 
process utterances from a caller or user by way of a 
recogniser, an utterance processing system and a 
dialogue processing system so as to produce responses 
to said utterances, a method of gauging correctness of 
a pattern recognition task by mapping an input feature 
space onto a probability space . 

2 • The method of Claim 1 wherein said mapping is a non- 
linear mapping. 

3 . The method of Claim 1 or Claim 2 applied to a 
confidence scoring task. 

4. The method of any previous claim utilizing a multi- 
layer perceptron to apply multiple knowledge sources 
to said confidence scoring task. 

5. In a speech recognition system of the type adapted to 
process utterances from a caller or user by way of a 
recogniser, an utterance processing system and a 
dialogue processing system so as to produce responses 
to said utterances, a method of obtaining a confidence 
score by a non- linear mapping onto the real number 
line . 
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6. The method of any previous claim utilizing an MLP to 
generate an aposteriori probability for confidence. 

7. The method of Claim 6 wherein said MLP is trained with 
a mean squared error. 

5 8. The method of Claim 6 wherein said MLP is trained 
utilizing a cross-entropy cost function. 

9. The method of Claim 7 and Claim 8 wherein said MLP is 
additionally trained with some sigmoidal non- 
linearity. 

10 10. In a speech recognition system of the type adapted to 
process utterances from a caller or user by way of a 
recogniser, an utterance processing system and a 
dialogue processing system so as to produce responses 
to said utterances, a method of confidence scoring 

15 utilizing a data driven system. 

11. A speech recognition system operating according to the 
method of any previous claim. 



20 
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AMENDED CLAIMS 

[received by the International Bureau on 1 1 November 2002 (1 1.1 1.02); 
original claims 1-11 replaced by new claims 1-17] 

1. In a speech recognition system of the type adapted to 
process utterances from a caller or user by way of a 
recogniser^ an utterance processing system and a 
dialogue processing system so as to produce responses 
to said utterances, a method of gauging correctness of 
a pattern recognition task by mapping an input feature 
space onto a probability space. 

2. The method of Claim 1 wherein said mapping is a non- 
linear mapping* 

3. The method of Claim 1 or Claim 2 applied to a 
confidence scoring task, 

4. The method of any previous claim utilizing a multi- 
layer perceptron (MLP) to apply multiple knowledge 
sources to said confidence scoring task. 

5. In a speech recognition system of the type adapted to 
process utterances from a caller or user by way* of a 
recogniser/ an utterance processing system and a 
dialogue processing system so as to produce responses 
to said utterances, a method of obtaining a confidence 
score by a non-linear mapping onto the real number 
line. 
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6. The method of any previous claim utilizing an MLP to 
generate an aposteriori probability for confidence. 

7. The method of Claim 6 wherein said MLP is trained with 
a mean squared error « 

5 8. The method of Claim 6 wherein said MLP is trained 
utilizing a cross-entropy cost function. 

9. The method of Claim 7 and Claim 8 wherein said MLP is 
additionally trained with some sigmoidal non- 
linearity. 

10 10. In a speech recognition system of the type adapted to 
process utterances from a caller or user by way of a 
recogniser, an utterance processing system and a 
dialogue processing system so as to produce responses 
to said utterances^ a method of confidence scoring 

15 utilizing a data driven system. 

11. A speech recognition system operating according to the 
method of any previous claim, 

12 • In a speech recognition system of the type adapted to 
process utterances from a caller or user by way of a 
20 recogniser, an utterance processing " system and a 

dialogue processing system so as to produce responses, 
to said utterances, a method of gauging correctness of 
a pattern recognition task by mapping an input feature 
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space onto a probability space; and wherein said 
mapping is a non^linear mapping; and wherein said 
method is applied to a confidence scoring task; said 
method utilizing a multi-layer perceptron to apply 
multiple knowledge sources to said confidence scoring 
task; said method utilizing said multi-layer 
perceptron to generate an aposteriori probability for 
confidence and wherein said multi-layer perceptron is 
trained with a mean squared error; said multi-layer 
perceptron also trained utilizing a cross'-entropy cost 
function. 

13. The method of Claim 12 utilising a data driven system. 

14. The method of Claim 12 or 13 wherein said multi-layer 
perceptron is additionally trained with some sigmoidal 
non-linearity . 

15. A speech recognition system operating according to the 
method of any one of Claims 12 to 14. 

16. In a speech recognition system of the type adapted to 
process utterances from a caller or user by way of a 
recogniser, an utterance processing system and a 
dialogue processing system so as to produce responses 
to said utterances / a method of gauging correctness of 
a pattern recognition task by mapping an input feature 
space onto a probability space; and wherein said 
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mapping is a non-linear mapping; and wherein said 
method is applied to a confidence scoring task; said 
method utilizing a multi-layer perceptron to apply 
multiple knowledge sources to said confidence scoring 

5 task; said method utilizing said multi-layer 

perceptron to generate an aposteriori probability for 
confidence arid wherein said multi-layer perceptron is 
trained with a mean squared error; said multi-layer 
perceptron also trained utilizing a cross-entropy cost 

10 function; and wherein said multi-layer perceptron is 

additionally trained with some sigmoidal non- 
linearity. 

17- A speech recognition system operating according to the 
method of Claim 16, 
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